27 - HPC-Cafe 2022-01-18: Using File Systems Properly [ID:40199]

50 von 224 angezeigt

Today's topic is using file systems properly.

And for this purpose, we have put together some general information that's also contained in our HPC in a nutshell introduction, which Katrin Nusser will present tomorrow, as usual.

So there's a very quick overview of file systems, what they are, what to do with them. And then we go into the details of how you can avoid putting too much strain on those file systems and slowing down your jobs and other users' jobs.

So the first part will be taken over by Markus Wittmann, and the second part will be presented by me.

Okay, so Markus, if you would.

Yes, thank you, Georg.

So first I will say something to the file systems we have here at RRZE.

So a file system you probably know is something like a directory structure that can store files.

And next point, please Georg.

Yeah, and we can under Linux or Unix mount several file system at the same node. You probably know it from Windows when you're coming from there. There you have the C colon or D colon, which go to several file systems.

And under Linux, we can also mount a file system anywhere in the root file system.

As I said, at RRZE, we have different file systems and they differ in the size, in the redundancy, and what you should do with them.

So next slide, please.

So the first file system you probably all working with or know is $home. This is where your HPC home is located and its mount point is under slash home slash HPC.

There the idea is to store source code there to have important results from simulations or other things.

And it's located on NFS servers on central servers.

You have a backup and you have snapshots, which are taken every 30 minutes.

It's available as long as your account or HPC account is alive and you have a quota of around 550 gigabytes.

On $HPC vault at similar it's located on slash home slash vault.

There the idea is to store mid and long term storage.

It's also located on central servers. You have a backup, but snapshots are only available for one per day.

It's also tied to your HPC account.

So as long as HPC account is available, the file system is available and there you have a quota of 500 gigabytes.

Another file system is $work.

Another file system is $work, which depending which can be mounted, which can be located on a different file system.

It can either be home woody, home Saturn or home Titan.

And this is the general purpose file system.

There you can have short and mid term storage, small files, large files.

It's another NFS file system, but there is no backup and there are no snapshots.

You also have 500 gigabytes of quota.

And if you're working on Mackey, you have another file system available, $fast temp, which is a parallel file system.

Optimized for parallel I.O., especially useful if you do large parallel read and write operations.

There is also no backup and no snapshots.

And the quota is that you can use and there is no quota in place.

Only the number of inodes is limited.

Also available on the nodes is the $tempdir.

And this points to a node local or a job local directory, which can be either located on an HDD, an SDD or a RAM disk.

There's no backup and there are no snapshots.

And it's very important is that this file system is only available as long as a job is running,

which means when the job ends, the data you have stored there will be deleted.

So if you want to keep the data or stored results there, then you have to copy them back to one of the other files that we talked about before.

If you're using TinyGPU, then each node has an SDD in and tempdir points to the SDD.

And you have available around, I think, one terabyte up to 5.8 terabyte, depending on the generation of the node.

OK, then next slide, please, Georg.

Probably one more thing, if the size of the local file system is of concern to you because you have a lot of files and a lot of local data to deal with, then it's probably a good idea to look up the documentation and choose the appropriate node for that purpose.

OK, so don't just assume blindly that there's enough space available on the local disk.

Also, the disk, although it may be large, several terabytes, other users may be using the disk at the same time if the node is shared.

So take care to check that no overflows happen because this will usually terminate your job in one or the other way.

So now let's come to the issue, as Georg said, we already observed high, we observed phases of high load on the NFS servers.

And this occurs, or one problem when this occurs is when there's a job running which is a large number of files that are located on one of the NFS file systems.

Teil einer Videoserie :

HPC4FAU / NHR@FAU

Teil eines Kapitels:

HPC Café

Presenters

Dr. Georg Hager

Zugänglich über

Offener Zugang

Dauer

00:43:15 Min

Aufnahmedatum

2022-01-18

Hochgeladen am

2022-01-19 09:36:04

Sprache

en-US

Speakers: Markus Wittmann and Georg Hager, NHR@FAU

Slides available at https://hpc.fau.de/files/2022/01/2022-01-11-hpc-cafe-file-systems.pdf

We provide some guidelines for handling large collections of files in your batch jobs on our systems. We have observed phases of heavy overload on NFS file servers when certain types of jobs are running. This is caused by jobs which handle data ineffectively - especially data scattered over many thousands of files, but also data that is accessed frequently -, thereby slowing down file operations to a crawl. This has an impact on all users, not only those who actually cause the problem.

In this talk we give a quick overview of the available file systems and show you some strategies to avoid such situations by using local disks within the compute nodes instead of the shared NFS servers.

Tags

Per RSS abonnieren